knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
textmineR was created with three principles in mind:
R has many packages for text mining and natural language processing (NLP). The CRAN task view on natural language processing lists 53 unique packages. Some of these packages are interoperable. Some are not.
textmineR strives for maximum interoperability in three ways. First, it uses the dgCMatrix
class from the popular Matrix
package for document term matrices (DTMs) and term co-occurence matrices (TCMs). The Matrix
package is an R "recommended" package with nearly 500 packages that depend, import, or suggest it. Compare that to the slam
package used by tm
and its derivatives. slam
has an order of magnitude fewer dependents. It is simply not as well integrated. Matrix
also has methods that make the syntax for manipulating its matrices nearly identical to base R. This greatly reduces the cognitive burden of the programmers.
Second, textmineR relies on base R objects for corpus and metadata storage. Actually, it relies on the user to do so. textmineR's core functions CreateDtm
and CreateTcm
take a simple character vector as input. Users may store their corpora as character vectors, lists, or data frames. There is no need to learn a new 'Corpus' class.
Third and last, textmineR represents the output of topic models in a consistent way, a list containing two matrices. This is described in more detail in the next section. Several topic models are supported and the simple representation means that textmineR's utility functions are useable with outputs from other packages, so long as they are represented as matrices of probabilities. (Again, see the next section for more detail.)
textmineR acheives scalability through three means. First, sparse matrices (like the dgCMatrix
) offer significant memory savings. Second, textmineR utilizes Rcpp
throughout for speedup. Finally, textmineR uses parallel processing by default where possible. textmineR offers a function TmParallelApply
which implements a framework for parallel processing that is syntactically agnostic between Windows and Unix-like operating systems. TmParallelApply
is used liberally within textmineR and is exposed for users.
textmineR does make some tradeoffs of performance for syntactic simplicity. textmineR is designed to run on a single node in a cluster computing environment. It can (and will by default) use all available cores of that node. If performance is your number one concern, see text2vec
. textmineR uses some text2vec
under the hood.
textmineR strives for syntax that is idiomatic to R. This is, admittedly, a nebulous concept. textmineR does not create new classes where existing R classes exist. It strives for a functional programming paradigm. And it attempts to group closely-related sequential steps into single functions. This means that users will not have to make several temporary objects along the way. As an example, compare making a document term matrix in textmineR (example below) with tm
or text2vec
.
As a side note: textmineR's framework for NLP does not need to be exclusive to textmineR. Text mining packages in R can be interoperable with a few concepts. First, use dgCMatrix
for DTMs and TCMs. Second, write most text mining models in a way that they can take a dgCMatrix
as the input. Finally, keep non-base R classes to a minimum, especially for corpus and metadata management. For the most part, text2vec
and tidytext
adhere to these principles.
The basic object of analysis for most text mining applications is a document term matrix, or DTM. This is a matrix where every row represents a document and every column represents a token (word, bi-gram, stem, etc.)
You can create a DTM with textmineR by passing a character vector. There are options for stopword removal, creation of n-grams, and other standard data cleaning. There is an option for passing a stemming or lemmatization function if you desire. (See help(CreateDtm)
for an example using Porter's word stemmer.)
The code below uses a dataset of movie reviews included with the text2vec
package. This dataset is used for sentiment analysis. In addition to the text of the reviews. There is a binary variable indicating positive or negative sentiment. More on this later...
library(textmineR) # load movie_review dataset from text2vec data(movie_review, package = "text2vec") str(movie_review) # create a document term matrix dtm <- CreateDtm(doc_vec = movie_review$review, # character vector of documents doc_names = movie_review$id, # document names ngram_window = c(1, 2), # minimum and maximum n-gram length stopword_vec = c(tm::stopwords("english"), # stopwords from tm tm::stopwords("smart")), # this is the default value lower = TRUE, # lowercase - this is the default value remove_punctuation = TRUE, # punctuation - this is the default remove_numbers = TRUE) # numbers - this is the default dim(dtm) # 5,000 documents and 424,926 tokens
The code below performs some basic corpus statistics. textmineR has a built in function for getting term frequencies across the corpus. This function TermDocFreq
gives term frequencies (equivalent to colSums(dtm)
), the number of documents in which each term appears (equivalent to colSums(dtm > 0)
), and an inverse-document frequency (IDF) vector. The IDF vector can be used to create a TF-IDF matrix.
# get counts of tokens across the corpus tf_mat <- TermDocFreq(dtm = dtm) str(tf_mat) # look at the most frequent tokens head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10) # look at the most frequent bigrams tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ] head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10) # it looks like we have stray html tags (<br>) that weren't stripped # from our documents. Let's fix that. dtm <- dtm[ , ! stringr::str_detect(colnames(dtm), "^br$|_br$|^br_") ] tf_mat <- TermDocFreq(dtm) tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ] head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10) head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10) # summary of document lengths doc_lengths <- rowSums(dtm) summary(doc_lengths)
Often,it's useful to prune your vocabulary and remove any tokens that appear in a samall number of documents. This will greatly reduce the vocabulary size (see Zipf's law) and improve computation time.
# remove any tokens that were in 3 or fewer documents dtm <- dtm[ , colSums(dtm > 0) > 3 ] # alternatively: dtm[ , tf_mat$term_freq > 3 ] tf_mat <- tf_mat[ tf_mat$term %in% colnames(dtm) , ] tf_bigrams <- tf_bigrams[ tf_bigrams$term %in% colnames(dtm) , ]
You can get a lot of mileage out of simple corpus statistics. The code below uses simple probabilistic reweighting (instead of TF-IDF) to see the unigrams and bigrams most-associated with positive and negative sentiment.
# what words are most associated with sentiment? tf_sentiment <- list(positive = TermDocFreq(dtm[ movie_review$sentiment == 1 , ]), negative = TermDocFreq(dtm[ movie_review$sentiment == 0 , ])) # these are basically the same, not helpful head(tf_sentiment$positive[ order(tf_sentiment$positive$term_freq, decreasing = TRUE) , ], 10) head(tf_sentiment$negative[ order(tf_sentiment$negative$term_freq, decreasing = TRUE) , ], 10) # let's reweight by probability by class p_words <- colSums(dtm) / sum(dtm) # alternatively: tf_mat$term_freq / sum(tf_mat$term_freq) tf_sentiment$positive$conditional_prob <- tf_sentiment$positive$term_freq / sum(tf_sentiment$positive$term_freq) tf_sentiment$positive$prob_lift <- tf_sentiment$positive$conditional_prob - p_words tf_sentiment$negative$conditional_prob <- tf_sentiment$negative$term_freq / sum(tf_sentiment$negative$term_freq) tf_sentiment$negative$prob_lift <- tf_sentiment$negative$conditional_prob - p_words # let's look again with new weights head(tf_sentiment$positive[ order(tf_sentiment$positive$prob_lift, decreasing = TRUE) , ], 10) head(tf_sentiment$negative[ order(tf_sentiment$negative$prob_lift, decreasing = TRUE) , ], 10) # what about bi-grams? tf_sentiment_bigram <- lapply(tf_sentiment, function(x){ x <- x[ stringr::str_detect(x$term, "_") , ] x[ order(x$prob_lift, decreasing = TRUE) , ] }) head(tf_sentiment_bigram$positive, 10) head(tf_sentiment_bigram$negative, 10)
textmineR has extensive functionality for topic modeling. You can fit Latent Dirichlet Allocation (LDA), Correlated Topic Models (CTM), and Latent Semantic Analysis (LSA) from within textmineR. (Examples with LDA and LSA follow below.) As of this writing, textmineR's LDA and CTM functions are wrappers for other packages to facilitate a consistent workflow. Plans exist to impelement LDA natively with Rcpp
sometime in 2018.
textmineR's consistent representation of topic models boils down to two matrices. The first, "theta" ($\Theta$), has rows representing a distribution of topics over documents. The second, phi ($\Phi$), has rows representing a distribution of words over topics. In the case of probabilistic models, these are categorical probability distributions. In the case of non-probabilistic models (e.g. LSA) these distributions are, obviously, not probabilities. In the case of LSA, for example, there is a third object representing the sigular values in the decomposition.
In addition, textmineR has utility functions for topic models. This includes some original research. Examples include an R-squared for probabilistic topic models (working paper here), probabilistic coherenc (a measure of topic quality), and a topic labeling function based on most-probable bigrams. Other utilities are demonstrated below
# start with a sample of 500 documents so our example doesn't take too long dtm_sample <- dtm[ sample(1:nrow(dtm), 500) , ] # Fit a Latent Dirichlet Allocation model # note the number of topics is arbitrary here # see extensions for more info model <- FitLdaModel(dtm = dtm_sample, k = 100, iterations = 800, alpha = 0.1, # this is the default value beta = 0.05) # this is the default value # two matrices: # theta = P(topic | document) # phi = P(word | topic) str(model) # R-squared # - only works for probabilistic models like LDA and CTM model$r2 <- CalcTopicModelR2(dtm = dtm_sample, phi = model$phi, theta = model$theta) model$r2 # log Likelihood (does not consider the prior) # - only works for probabilistic models like LDA and CTM model$ll <- CalcLikelihood(dtm = dtm_sample, phi = model$phi, theta = model$theta) model$ll # probabilistic coherence, a measure of topic quality # - can be used with any topic model, e.g. LSA model$coherence <- CalcProbCoherence(phi = model$phi, dtm = dtm_sample, M = 5) summary(model$coherence) hist(model$coherence, col= "blue") # Get the top terms of each topic model$top_terms <- GetTopTerms(phi = model$phi, M = 5) head(t(model$top_terms)) # Get the prevalence of each topic # You can make this discrete by applying a threshold, say 0.05, for # topics in/out of docuemnts. model$prevalence <- colSums(model$theta) / sum(model$theta) * 100 # textmineR has a naive topic labeling tool based on probable bigrams model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm_sample, M = 1) head(model$labels) # put them together, with coherence into a summary table model$summary <- data.frame(topic = rownames(model$phi), label = model$labels, coherence = round(model$coherence, 3), prevalence = round(model$prevalence,3), top_terms = apply(model$top_terms, 2, function(x){ paste(x, collapse = ", ") }), stringsAsFactors = FALSE) View(model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ]) # Get topic predictions for all 2,000 documents # first get a prediction matrix, phi is P(word | topic) # we need P(topic | word), or "phi_prime" model$phi_prime <- CalcPhiPrime(phi = model$phi, theta = model$theta) # set up the assignments matrix and a simple dot product gives us predictions assignments <- dtm / rowSums(dtm) assignments <- assignments %*% t(model$phi_prime) assignments <- as.matrix(assignments) # convert to regular R dense matrix # compare the "fit" assignments to the predicted ones barplot(model$theta[ rownames(dtm_sample)[ 1 ] , ], las = 2, main = "Topic Assignments While Fitting LDA") barplot(assignments[ rownames(dtm_sample)[ 1 ] , ], las = 2, main = "Topic Assignments Predicted Under the Model")
Depending on your application, you can reformat the outputs of phi, theta, assignments, the summary table etc. to suite your needs. For example, you can build a "semantic" search of your documents by vectorizing the query with CreateDtm
, then predicting under the model with phi_prime.
As of this writing, you will have to take care to make sure your vocabulary aligns. (I'd suggest using something like intersect(colnames(dtm), colnames(theta))
.) Part of the big 2018 update will be creating a predict method that will handle this for you.
Latent semantic analysis was arguably the first topic model. LSA was patented in 1988. It uses a single value decomposition on a document term matrix, TF-IDF matrix, or similar.
The workflow for LSA is largely the same for LDA. Two key differences: we will use the IDF vector mentioned above to create a TF-IDF matrix and we cannot get an R-squared for LSA as it is non-probabilistic.
# get a tf-idf matrix tf_sample <- TermDocFreq(dtm_sample) tf_sample$idf[ is.infinite(tf_sample$idf) ] <- 0 # fix idf for missing words tf_idf <- t(dtm_sample / rowSums(dtm_sample)) * tf_sample$idf tf_idf <- t(tf_idf) # Fit a Latent Semantic Analysis model # note the number of topics is arbitrary here # see extensions for more info lsa_model <- FitLsaModel(dtm = tf_idf, k = 100) # three objects: # theta = distribution of topics over documents # phi = distribution of words over topics # sv = a vector of singular values created with SVD str(lsa_model) # probabilistic coherence, a measure of topic quality # - can be used with any topic lsa_model, e.g. LSA lsa_model$coherence <- CalcProbCoherence(phi = lsa_model$phi, dtm = dtm_sample, M = 5) summary(lsa_model$coherence) hist(lsa_model$coherence, col= "blue") # Get the top terms of each topic lsa_model$top_terms <- GetTopTerms(phi = lsa_model$phi, M = 5) head(t(lsa_model$top_terms)) # Get the prevalence of each topic # You can make this discrete by applying a threshold, say 0.05, for # topics in/out of docuemnts. lsa_model$prevalence <- colSums(lsa_model$theta) / sum(lsa_model$theta) * 100 # textmineR has a naive topic labeling tool based on probable bigrams lsa_model$labels <- LabelTopics(assignments = lsa_model$theta > 0.05, dtm = dtm_sample, M = 1) head(lsa_model$labels) # put them together, with coherence into a summary table lsa_model$summary <- data.frame(topic = rownames(lsa_model$phi), label = lsa_model$labels, coherence = round(lsa_model$coherence, 3), prevalence = round(lsa_model$prevalence,3), top_terms = apply(lsa_model$top_terms, 2, function(x){ paste(x, collapse = ", ") }), stringsAsFactors = FALSE) View(lsa_model$summary[ order(lsa_model$summary$prevalence, decreasing = TRUE) , ]) # Get topic predictions for all 2,000 documents # first get a prediction matrix, lsa_model$phi_prime <- diag(lsa_model$sv) %*% lsa_model$phi lsa_model$phi_prime <- t(MASS::ginv(lsa_model$phi_prime)) # set up the assignments matrix and a simple dot product gives us predictions lsa_assignments <- t(dtm) * tf_sample$idf lsa_assignments <- t(lsa_assignments) lsa_assignments <- lsa_assignments %*% t(lsa_model$phi) lsa_assignments <- as.matrix(lsa_assignments) # convert to regular R dense matrix # compare the "fit" lsa_assignments to the predicted ones # distribution is the same, but scale changes barplot(lsa_model$theta[ rownames(dtm_sample)[ 1 ] , ], las = 2, main = "Topic Assignments While Fitting LSA") barplot(lsa_assignments[ rownames(dtm_sample)[ 1 ] , ], las = 2, main = "Topic Assignments Predicted Under the Model")
Document clustering can be thought of as a topic model where each document contains exactly one topic. textmineR's Cluster2TopicModel
function allows you to take a clustering solution and a document term matrix and turn it into a probabilistic topic model representation. You can use many of textmineR's topic model utilities to evaluate your clusters (e.g. R-squared, coherence, labels, etc.)
There is no commonly accepted way to choose the number of topics in a topic model. Fear not! Probabilistic coherence can help you. In forthcoming research, I show that probabilistic coherence can find the correct number of topics on a simulated corpus where the number of topics is known beforehand. (This will be part of a PhD dissertation, sometime around 2021. Stand by!)
Users can implement this procedure. Simply fit several topic models across a range of topics. Then calculate the probabilistic coherence for each topic in each model. Finally, average the probabilistic coherence across all topics in a model. This is similar to using the silhouette coefficient to select the number of clusters when clustering.
Some example code (on a trivially small dataset) is below.
# load a sample DTM data(nih_sample_dtm) # choose a range of k # - here, the range runs into the corpus size. Not recommended for large corpora! k_list <- seq(5, 95, by = 5) # set up a temporary directory to store fit models so you get partial results # if the process fails or times out. This is a trivial example, but with a decent # sized corpus, the procedure can take hours or days, depending on the size of # the data and complexity of the model. # I'm using the digest function to create a hash so that it's obvious this is a # temporary directory model_dir <- paste0("models_", digest::digest(colnames(nih_sample_dtm), algo = "sha1")) if (!dir.exists(model_dir)) dir.create(model_dir) # Fit a bunch of LDA models # even on this trivial corpus, it will a bit of time to fit all of these models model_list <- TmParallelApply(X = k_list, FUN = function(k){ filename = file.path(model_dir, paste0(k, "_topics.rda")) if (!file.exists(filename)) { m <- FitLdaModel(dtm = nih_sample_dtm, k = k, iterations = 500) m$k <- k m$coherence <- CalcProbCoherence(phi = m$phi, dtm = nih_sample_dtm, M = 5) save(m, file = filename) } else { load(filename) } m }, export=c("nih_sample_dtm", "model_dir")) # export only needed for Windows machines # Get average coherence for each model coherence_mat <- data.frame(k = sapply(model_list, function(x) nrow(x$phi)), coherence = sapply(model_list, function(x) mean(x$coherence)), stringsAsFactors = FALSE) # Plot the result # On larger (~1,000 or greater documents) corpora, you will usually get a clear peak plot(coherence_mat, type = "o")
Topic models from other packages can be used with textmineR. The workflow would look something like this:
CreateDtm
to create a curated DTMDtm2Docs
to re-create a text vector of curated tokens from your DTMText embeddings are particularly hot right now. While textmineR doesn't (yet) explicitly implement any embedding models like GloVe or word2vec, you can still get embeddings. Text embedding algorithms aren't conceptually different from topic models. They are, however, operating on a different matrix. Instead of reducing the dimensions of a document term matrix, text embeddings are obtained by reducing the dimensions of a term co-occurrence matrix. In principle, one can use LDA or LSA in the same way. In this case, rows of theta are embedded words. A phi_prime may be obtained to project documents or new text into the embedding space.
What follows is a quick example of this using LDA as the embedding mechanism.
# First create a TCM using skip grams, we'll use a 5-word window # most options available on CreateDtm are also available for CreateTcm tcm <- CreateTcm(doc_vec = movie_review$review, skipgram_window = 5) # use LDA to get embeddings into probability space # This will take considerably longer as the TCM matrix has many more rows # than a DTM embeddings <- FitLdaModel(dtm = tcm, k = 100, iterations = 800) # Get an R-squared for general goodness of fit embeddings$r2 <- CalcTopicModelR2(dtm = tcm, phi = embeddings$phi, theta = embeddings$theta) embeddings$r2 # Get coherence (relative to the TCM) for goodness of fit embeddings$coherence <- CalcProbCoherence(phi = embeddings$phi, dtm = tcm) summary(embeddings$coherence) # Get top terms, no labels because we don't have bigrams embeddings$top_terms <- GetTopTerms(phi = embeddings$phi, M = 5) head(t(embeddings$top_terms)) # Create a summary table, similar to the above embeddings$summary <- data.frame(topic = rownames(embeddings$phi), coherence = round(embeddings$coherence, 3), prevalence = colSums(embeddings$theta) / sum(embeddings$theta) * 100, top_terms = apply(embeddings$top_terms, 2, function(x){ paste(x, collapse = ", ") }), stringsAsFactors = FALSE) View(embeddings$summary[ order(embeddings$summary$prevalence, decreasing = TRUE) , ]) # Embed the documents dtm_embed <- CreateDtm(doc_vec = movie_review$review, doc_names = movie_review$id, ngram_window = c(1,1)) dtm_embed <- dtm_embed[ , colnames(tcm) ] # make sure vocab lines up embeddings$phi_prime <- CalcPhiPrime(phi = embeddings$phi, theta = embeddings$theta) embedding_assignments <- dtm_embed / rowSums(dtm_embed) embedding_assignments <- embedding_assignments %*% t(embeddings$phi_prime) embedding_assignments <- as.matrix(embedding_assignments) # get a goodness of fit relative to the DTM embeddings$r2_dtm <- CalcTopicModelR2(dtm = dtm_embed, phi = embeddings$phi, theta = embedding_assignments) embeddings$r2_dtm # get coherence relative to DTM embeddings$coherence_dtm <- CalcProbCoherence(phi = embeddings$phi, dtm = dtm_embed) summary(embeddings$coherence_dtm)
You could just as easily use LSA as your embedding, or any other dimensionality reduction/matrix factorization method. The advantge of using the dgCMatrix
is that it is so widely supported in the R ecosystem.
Embeddings are only recently being researched. However, they may be used in very similar contexts to topic models. It's just that the "topics" are fit another way.
Let's use the above embeddings model to create a document summarizer. This will return the three most relevant sentences in each review.
library(igraph) # let's do this in a function summarizer <- function(doc, phi_prime) { # recursive fanciness to handle multiple docs at once if (length(doc) > 1 ) # use a try statement to catch any weirdness that may arise return(sapply(doc, function(d) try(summarizer(d, phi_prime)))) # parse it into sentences sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]] names(sent) <- seq_along(sent) # so we know index and order # embed the sentences in the model e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE) # remove any documents with 2 or fewer words e <- e[ rowSums(e) > 2 , ] vocab <- intersect(colnames(e), colnames(phi_prime)) e <- e / rowSums(e) e <- e[ , vocab ] %*% t(phi_prime[ , vocab ]) e <- as.matrix(e) # get the pairwise distances between each embedded sentence e_dist <- CalcHellingerDist(e) # turn into a similarity matrix g <- 1 - e_dist * 100 diag(g) <- 0 # turn into a nearest-neighbor graph g <- apply(g, 1, function(x){ x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0 x }) g <- pmax(g, t(g)) g <- graph.adjacency(g, mode = "undirected", weighted = TRUE) # calculate eigenvector centrality ev <- evcent(g) # format the result result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ] result <- result[ order(as.numeric(names(result))) ] paste(result, collapse = " ") } # Let's see the summary of the first couple of reviews docs <- movie_review$review[ 1:3 ] names(docs) <- movie_review$id[ 1:3 ] sums <- summarizer(docs, phi_prime = embeddings$phi_prime) sums
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.